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1.  PURPOSE 


1.1  SCOPE 

This  report  discusses  the  work  performed  for  the  U.  S.  Army  Signal 
Research  and  Development  Laboratory  under  Contract  No.  DA  36-039-SC-90787 
during  the  period  from  1  July  1962  to  30  September  1962. 

1.2  OBJECTIVES 

The  objective  of  this  project  is  to  investigate  the  techniques  and 
concepts  of  information  retrieval  and  to  formulate  and  develop  a  general 
theory  of  information  retrieval.  The  formalization  of  this  theory  is 
oriented  to  the  automation  of  large-capacity  information  storage  and 
retrieval  systems.  This  theoretical  framework  will  be  the  basis  for 
designing  a  general  purpose  stored-program  digital  computer  system  to 
perform  the  storage  and  retrieval  functions. 

1.3  PROJECT  TASKS 

The  initial  phase  of  this  period  was  spent  in  defining  the  frame  of 
reference,  including  the  limitations  and  constraints,  for  this  research 
activity.  During  this  phase  activity  was  oriented  to  the  possibility  of 
defining  an  aspect  of  equipment  design  that  could  be  fruitfully  exploited. 
The  difficulty  encountered  in  this  approach  was  the  lack  of  a  definitive 
theoretical  concept  to  use  as  a  foundation  for  design  criteria. 

The  second  phase  was  spent  in  evaluating  existing  storage  and  retrieval 
systems,  primarily  to  discern  the  major  functional  characteristics  of  these 
systems.  As  a  result,  a  limited  number  of  basic  characteristics,  each  with 
a  small  number  of  variations,  were  isolated.  Subsequent  activity  has  been 
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expended  In  analyzing  the  nature  of  these  characterlstlos  in  terms  of 
rudimentary  information-store,  interrogate -retrieve  interrelationships. 

In  the  third  phase  three  tasks  were  defined,  and  activity  was  con¬ 
centrated  in  these  areas: 

(a)  Formulation  of  General  Principles. 

(b)  Development  of  Information-Retrieval  Model. 

(c)  Development  of  Functional  Elements. 

Activity  will  continue  in  these  tasks,  particularly  as  discussed  in 
Section  ii  of  this  report. 
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2.  ABSTRAC3T 


This  report  discusses  research  activity  performed  in  the  investigation 
of  the  techniques  and  concepts  of  information  retrieval.  The  general 
problems  of  information  storage  and  retrieval  are  reviewed  to  establish  a 
framework  for  the  development  of  general  theoretical  principles.  A  pre¬ 
liminary  model  is  presented  as  a  medium  for  analyzing  the  functional  char- 
acteirlstios  of  the  storage  and  retrieval  problem.  Specific  aspects  of  the 
problem — descriptor  systems,  file  structures,  and  search  proced\ires— are 
examined}  and  several  measures  of  relevance  are  evaluated. 
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3.  PUBLICAnONS,  REPORTS.  AND  CONFERENCES 

3.1  TECHNICAL  NOTES 


The  following  internal  technical  memoranda  were  issued  during  this 
reporting  period: 

(a)  lEC  TECHNICAL  NOTE,  File  No.  P-AA-TN-(0033)-N,  16  July  1962} 
Recommendations  for  Research  in  Information  Retrieval,  Quentin  A. 
Darmstadt  and  Alfred  Trachtenberg. 

(b)  lEC  TECHNICAL  NOTE,  File  No.  P-AA-TN-(0035)-N,  10  August  1962} 
Review  of  Present  Day  Information  Retrieval  gystems,  Alfred 
Trachtenberg. 

3.2  REPORTS 

The  following  reports  were  issued  during  this  reporting  period: 

(a)  MONTHLY  LETTER  REPORT  NO.  1,  1  July  1962  -  31  July  1962,  File  No. 
P-AA-TR-(0006),  3  August  1962}  Research  in  Information  Retrieval, 
Alfred  Trachtenberg. 

(b)  MONTHLY  LETTER  REPORT  NO.  2,  1  August  1962  -  31  August  1962, 

File  No.  P-AA-TR-(0009) ,  31  August  1962}  Research  in  Information 
Retrieval ,  Alfred  Trachtenberg. 

3.3  CONFERENCES 

The  following  conferences  were  held  between  lEC  personnel  and  the 
Signal  Corps: 

(a)  5  July  1962 — Meeting  at  lEC.  Discussions  of  objectives  and  plans 
for  the  research  activity  were  initiated.  The  formulation  of  a 
method  of  approach  was  requested  for  presentation  at  the  next 
meeting . 

(b)  17  July  1962 — Meeting  at  lEC.  The  memorandum  referenced  in 
Paragraph  3.1(a)  was  used  as  the  basis  of  discussions  pertaining 
to  the  scope,  development  phases,  alternative  plans,  and 
recommended  direction  for  the  project. 
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(o)  18  jTily  1962 — Meeting  at  lEC.  Informal  discussion  of  Signal 
Corps  objectives  and  goals  for  research  activity. 

(d)  9  August  1962"Meetlng  at  Fbrt  Monmouth,  New  Jersey.  The  memo¬ 
randum  referenced  In  Paragraph  3.1(b)  was  used  as  the  basis  of 
discussions  pertaining  to  functional  characteristics  of  informa¬ 
tion  retrieval  systems.  No  particular  area  of  activity  was 
selected  for  further  study. 

(e)  10  September  1962— Meeting  at  lEC.  Several  methods  of  relating 
descriptor  systems  In  a  generalized  sense  were  discussed  in  rela¬ 
tion  to  the  requirements  for  a  file  structure.  The  analysis  and 
development  of  a  general  theory  was  recommended  as  the  objective 
of  the  project. 
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U.  FACTUAL  DATA 


h.l  STATEMENT  OF  THE  PROBLEM 

Original  Fomulatlon  -  The  technical  requirement  of  the  Signal 
Corps,  as  specified  in  SCL-U355,  is  for  "...a  research  investigation  of 
techniques  and  concepts  necessary  for  the  efficient  mechanization  of 
large-capacity  information  storage  and  retrieval  systems."  Among  the 
future  applied  objectives  suggested  as  guides  for  such  research  are 
"...problems  of  military  significance;  i.e.,  personnel  files,  intelli¬ 
gence  data,  etc." 

U.1.2  Alternative  Approaches  -  This  statement  of  the  problem  leads 
to  many  alternative  approaches  that  any  specific  research  program  may 
take.  Some  of  the  possibilities  that  arise,  and  that  have  been  taken  in 
the  past,  may  be  characterized  as  dichotomies: 

(a)  System  oriented  versus  specific  operation  oriented. 

(b)  A  real  system  (problem)  versus  a  hypothetical  system  (problem). 

(c)  Hardware  emphasis  versus  software  emphasis. 

(d)  Reduction  to  camonic  forms  versus  manipulation  of  canonic  forms. 
These  diohotomiies  should  not  be  construed  as  mutually  exclusive  alterna¬ 
tives  from  which  one  alternative  must  be  chosen  in  each  instance  in  order 
to  define  the  research  task.  The  following  discussion  explicates  some 
implications  of  emphasizing  certain  approaches  to  the  program  aixi  estab¬ 
lishes  the  validity  of  de-emphasizing  others. 

U. 1.2.1  System  Oriented  versus  Specified  Operation  Oriented  - 
The  need  for  information  retrieval  arises  vrtienever  an  individual  has  a 
question  that  he  believes  can  be  answered  by  referencing  some  pool  of  data; 
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for  the  present  neither  question  nor  answer  Is  rigorously  defined.  In 
general,  however,  the  concern  of  the  user  of  an  Information  retrieval 
>  jars  tern  Is  not  with  any  specific  docvunentatlon  processes  but  with  obtain- 
.ing  the  information  required  by  his  question. 

■/ 

In  this  sense  most  current  information  retrieval  systems  —except 
for  those  like  Baseball  (U)  or  ACSIMATIC  (11)— are  misnamed.  They  are 
only  parts  of  a  larger  system  containing  many  implicit  operations  per¬ 
formed  by  the  userj  and  these  operations  are  not  even  formally  specified 
nor  readily  specifiable. 

A  system  orientation  to  research  on  information  retrieval  would 
focus  on  the  Job  of  providing  the  answers  to  certain  kinds  of  questions 
about  certain  kinds  of  data.  Different  Job  contexts  (personnel  selec¬ 
tion,  scientific  research,  or  intelligence  analysis)  deal  with  different 
kinds  of  questions  and  different  kinds  of  raw  data  organization.  As  a 
consequence,  each  Job  generally  results  in  quite  different  operating  sys¬ 
tems  if  optimumly  designed. 

A  specific  operation  orientation  to  research  on  information 
retrieval  might  Justifiably  ignore  large  aspects  of  a  user's  Job  prob¬ 
lems  and  concentrate  upon  iji5)roving  specific  operations  used  in  many 
kinds  of  information  retrieval  systems— descriptor  assignment,  index 
organization,  or  search  procedure.  Such  research  might  deal  with  a 
spectrum  of  increasingly  sophisticated  approaches  to  specific  informa¬ 
tion  retrieval  procedures.  In  the  ideal  case  less  sophisticated  pro¬ 
cedures  might  be  special  cases  of  more  inclusive  systematic  or  theoretical 
formulations . 
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There  is  an  important  asymmetry  be  considered  in  selecting 
between  these  research  orientations.  The  system  orientation  is  directed 
to  the  optimum  use  of  the  state-of-the-art  in  doing  a  particular  job.  To 
the  extent  that  state-of-the-art  improvements  are  important  in  doing  the' 
system  job,  some  of  the  research  effort  may  also  be  directed  to  develop¬ 
ing  improved  retrieval  procedures.  The  procedure  oriented  approach  is 
concerned  with  improvements  in  the  state-of-the-art  and  need  not  concern 
itself  with  the  specialized  problems  of  any  given  information  retrieval 
system.  It  may  be  tempting  to  select  the  system  oriented  strategy  in  the 
hope  that  unusual  success  may  lead  to  state-of-the-art  iirprovements ;  but 
even  if  no  breakthroughs  occur,  at  letist  a  usable  system  >411  result. 

U.1.2.2  Real  versus  Hypothetical  Problem  -  The  problems  for 
research  may  be  to  design  a  system  for  a  specific  user  possessing  certain 
operational  requirements  or  to  develop  a  procedure  for  a  specific  exist¬ 
ing  information  retrieval  system.  These  alternatives  are  instances  of 
the  system  oriented  or  specific  procedure  oriented  approaches  to  a  real 
problem,  respectively.  In  contrast,  work  may  proceed  on  the  development 
of  a  hypothetical  system  or  the  refinement  oi  a  procedure  without  reference 
to  a  real  system, 

''"hir,  d;if:hotomy  has  been  stated  independently  of  the  system  versus 
procedure  alternative.  In  practice,  liov.-evcr,  it  is  more  prudent  to  adopt 
a  procedure  oriented  research  strategy  in  the  absence  of  specific  user 
requirements.  If  there  .^re  no  user  requirements,  then,  in  order  to  main¬ 
tain  an  artificial  system  orientation,  energy  must  be  diverted  to  the 
detailed  specification  of  hypothetical  system  requirements  that  are 
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virtuiUy  certain  never  to  coincide  vith  any  specific  real  job. 

U.1,2.3  Hardware  versus  Software  Eitphasle  -  This  distinction  is 
generally  familiar  and  requires  no  further  definition.  It  is  not  inde¬ 
pendent  of  the  preceding  dichotomies.  To  the  extent  that  research  pertains 
to  a  real  system,  it  is  impossible  to  avoid  detailed  hardware  considerations. 
To  the  extent  that  a  more  theoretical,  procedure  oriented  study  is  being 
undertaken,  hardware  may  become  a  secondary  consideration  for  future  devel¬ 
opment.  However,  procedure  oriented  research  in  regard  to  specific  hard¬ 
ware  may  also  be  meaningful. 

li.l.2.U  Reduction  to  Canonic  Foms  versus  Manipulation  of 
Canonic  Foims  -  In  any  existing  automated  information  retrieval  system 
either  data  or  question  Inputs  (and,  except  for  Baseball,  both  question 
and  data  Inputs)  must  be  highly  restricted  in  canonic  form  or  format.  The 
selection  of  convenient  canonic  forms  or  foimats  for  specific  Jobs  requires 
creative  system  analysis  and  a  system  orientation.  There  are  Information 
retrieval  system  research  and  development  programs  such  as  the  ACSIMilTIC 
intelligence  system  or  the  Western  Reserve  Library  system  (9)  whose  major 
value  (or  shortcoming)  is  based  upon  the  specification  of  a  new  Infoima- 
tion  format  for  a  specific  Job.  Similarly  there  are  procedure  oriented 
studies  focusing  either  upon  the  efficient  manipulation  of  a  specific 
canonic  form — e.g.,  the  multi-list  processing  techniques  of  Piywes,  On^, 
et  al  (2,3)  for  manipulating  data  in  attribute-value  form— or  upon  the 
automatio  reduction  of  ordinary  discourse  to  canonic  fom  for  automatlo 
information  retrieval— the  only  example  of  this  ease  is  Baseball. 
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U.1.3  Alternatives  Selected  -  The  original  lEC  position  was  relatively 
open  with  regard  to  these  alternatives.  It  was  assumed,  however,  that 
specific  user  requirements  related  to  an  eventual  application  of  the  pres¬ 
ent  research  might  be  available.  Then  a  major  aspect  of  a  sophisticated 
automated  system  would  involve  the  automated  reduction  of  both  questions 
and  data  to  canonic  formj  this  ..ype  of  system  would,  therefore,  require 
linguistic  analysis.  Both  of  these  orientations  have  been  de-en^jhasized 
in  the  discussions  of  project  objectives.  The  only  alternative  among  the 
dichotomies  that  has  been  clearly  rejected,  hov/ever,  is  work  on  a  "real" 
system.  While  hardware  considerations  may  thus  remain  secondary,  at  least 
during  the  early  stages  of  the  present  program,  it  is  desirable  not  to 
restrict  the  project  orientation  to  any  specific  retrieval  procedures  at 
this  time.  Even  on  the  question  of  reduction  to  canonical  form,  the  only 
area  that  has  been  eliminated  from  consideration  is  extensive  work  on 
linguistic  analysis  rather  than  on  more  general  problems  such  as  methods 
of  descriptor  assignment, 

U.l.U  Refined  Statement  of  the  Problem  -  The  problem  as  presently 
conceived  is  to  develop  a  general  theory  of  information  retrieval  whose 
primary  goal  is  its  use  as  a  system  tool  for  the  optimum  design  of  spe¬ 
cific  information  retrieval  systems  in  the  future.  In  terms  of  the 
dichotomies,  the  orientation  is  more  to  procedures  rather  than  systems, 
to  the  hypothetical  rather  than  the  real,  and  to  software  rather  than 
hardware.  To  the  extent  that  language  analysis  is  de-en^hasized,  the 
orientation  is  to  the  selection  and  manipulation  of  canonic  forms  rather 
than  to  the  automatic  conversion  of  ordinaxy  discourse  to  canonic  forms. 
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In  no  casoi  howeTer^  has  an  extreme  pole  of  the  dichotomies  been  selected. 
Thus,  the  orientation  is  clearly  to  a  theory  of  systems  that  can  be  applied 
to  the  design  of  specific  job  oriented  systems  in  their  entirety  rather 
than  to  a  specific  procediuro(s)  that  may  be  valuablej  to  dealing  with 
real  contexts  that  may  be  of  future  interest  to  the  Signal  Corps,  wherever 
possible,  rather  than  necessarily  limiting  the  stu^y  to  abstract  formalism; 
to  the  consideration  of  optimum  hardware  once  software  at  the  level  of 
algorithm  rather  than  machine  code  has  been  specified;  and  to  the  prob¬ 
lem  of  conversion  to  canonic  form  when  linguistic  complexity  is  not  the 
crltloal  problem. 

The  following  sections  describe  two  aspects  of  the  approach  to  this 
general  problem.  One  is  the  formulation  of  a  general  model  of  the  infor¬ 
mation  retrieval  process.  The  other  is  the  selection  of  specific  problems 
of  procedure  and  technique;  the  only  example  thus  far  is  the  problem  of 
relevance  and  its  measurement.  It  is  expected  that  the  information  retrieval 
model  will  provide  a  framework  both  for  understanding  the  critical  features 
of  information  retrieval  systems  of  different  levels  of  sophistication  and 
for  Isolating  critical  areas  of  Information  retrieval  procedures  and  tech¬ 
niques  to  focus  upon  for  further  developments, 

U.2  SYSTEM  MODELS 

U,2,l  General  -  The  name  Information  Retrieval  System  has  been 
applied  to  a  large  number  of  systems  of  varying  purpose  and  capability, 
from  personnel  file  and  literature  search  systems  to  systems  that  retrieve 
specific  bits  of  information  upon  request.  Outwardly,  these  systems  seen 
to  operate  on  different  principles;  but  if  a  general  information  retrieval 
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theory  is  possible,  it  must  be  able  to  show  the  basic  similarities  of 
these  systems.  It  is  necessary,  then,  to  examine  the  operation  of  each 
type  of  system  in  order  to  develop  a  model  that  would  be  valid  for  all 
systems . 

Intuitively,  the  literature  search  problem  with  descriptor  associa¬ 
tion  is  a  form  of  content  retrieval  or  an  intermediate  step  toward  it. 

At  least  theoretically,  the  process  is  continuous  in  tlie  sense  that  com¬ 
plete  content  retrieval  is  a  limiting  case  of  document  reti  iev'tl ,  In 
the  following  paragraphs  this  intuitive  argument  is  developed  more  rigor¬ 
ously  within  the  framework  of  a  general  retrieval  model. 

U.2,2  Formulation  of  General  Retrieval  Model  -  The  basis  of  the 
classic  literature  search  problem  is:  a  collection  of  documents  exists, 
and  a  researcher  desires  to  select  from  this  collection  a  document  or 
documents  that  are  pertinent  to  his  interests.  The  usual  approach  to 
this  problem  has  been  an  attempt  to  describe  the  stored  docvunents  by  a 
small  number  of  vrords  or  symbols  and  then  to  search  through  these  words 
or  symbols  until  some  match  has  been  obtained  lath  a  description  of  the 
area  of  interest.  This  process  is  illustrated  in  Figure  1. 

Dociunents,  1,  are  entered  into  the  system  and  analyzed.  On  the  basis 
of  this  analysis  descriptors  (i.e.,  terms  identifying  the  nature  of  the 
docvunent,  i)  are  assigned,  including  a  imique  identifying  number  or 
address.  This  analysis  and  descriptor  assignment  has  traditionally  been 
performed  by  human  beings,  although  methods  for  automatic  descriptor 
assignment  have  been  proposed  (6,7). 
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(Response  or  Ansvjer) 


FIGURE  1.  Model  of  Literature  Search  System 


After  the  descriptors  have  been  assigned,  docunents  are  placed  in  the 
item  store  in  accordance  with  their  assigned  address.  The  complete  set 
of  descriptors  is  deposited  in  the  descriptor  file  in  accordance  with  the 
organization  of  the  particular  file.  In  the  case  of  a  library  the  descriptor 
file  is  the  card  catalogue,  and  the  item  store  is  simply  the  shelves  on 
which  the  documents  are  stored. 


A  request  for  literature  is  translated  into  a  set  of  descriptors 


comparable  to  those  that  exist  in  the  descriptor  file.  These  descriptors 
are  then  matched  against  the  descriptors  in  the  descriptor  file;  when 
close  enough  matches  have  occurred,  the  addresses  associated  with  these 
matches  are  noted  and  tised  to  locate  the  desired  items. 

This  process  may  be  written  in  symbolic  form  in  terms  of  the  symbols 
in  Figure  1  (or  Figure  2): 

(a)  Documents  are  described: 

i  -  D(I)  (ii-1) 

where  i  is  a  set  of  descriptors;  I,  a  document;  and  D,  a 
transformation  algorithm. 

(b)  Questions  are  posed; 

q  -  E(Q)  (1-2) 

where  q  is  a  set  of  descriptors;  Q,  a  question;  and  E,  a 
transformation  algorithm. 

(c)  Question  and  document  descriptors  are  matched: 

a  -  P(q,S^)  (U-3) 

where  a  is  a  set  of  unique  addresses,  which  may  be  called  an 
additional  set  of  descriptors;  S.  is  the  set  of  all  descriptors 
in  the  descriptor  fiiie;  and  P  is  a  transformation  algorithm, 

(d)  The  desired  documents  are  located: 

H  =  D"^(a)  (li-li) 

where  D“^  is  the  inverse  of  the  address  assigning  transformation 
& 

algorithm.  This  function  might  also  be  written  more  generally 
as  R  »  D"^(a);  for  D^  is  part  of  D  (see  Equation  U-5). 

An  inherent  difficulty  of  existing  literature  search  systems  is  that 
the  response  may  include  superfluous  information.  Now,  let  content 
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retrl«T«l  be  defined  ee  the  prooese  for  obtaining  specif io  informatlen^ 
aoooapanied  by  little  or  no  superfluous  infomati(m,  in  response  to  a 
query.  If  Figure  1  is  then  restructured  as  showi  in  Figure  2,  it  is  dear 
that  this  prlaitire  ncdel  is  valid  for  literature  search  and  such  less«r 


R 

FIODRE  2.  Model  of  Oeneral  Retrieval  S^tem 


retrieval  systems  as  personnel  files.  But,  more  important,  it  is  also 
valid  for  a  general  content  retrieval  system.  This  general  retrieval 
model  accepts  input  data,  transforms  the  data  into  a  convenient  pre- 
established  form,  stores  the  data,  and  then  selects  responses  from  the 
data  on  the  basis  of  questions  or  lequests. 

U.2.3  Analysis  of  tbs  Model  -  It  is  now  necessaiy  to  examine  each 
functional  section  of  the  model  represented  by  Figure  2  in  order  to  ana¬ 
lyse  their  differences  and  to  determine  the  requirements  for  particular 
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types  of  eyetems 


U.2.3.1  The  D  Treaefoxa  •  For  the  lltereture  search  system,  the 
D  tMusfmnn  can  be  thought  of  as  having  two  parts,  and  D^.  Syinbolloally, 
this  form  of  the  transform  becomes  t 

D  -  (l»-5) 

The  first  part,  can  be  viewed  as  a  mapping  of  a  docume..  jmto  a  set 
of  symbols  that  ^present  the  content  of  the  doovnent.  A  mlque  traas* 
fomatlon  is  generally  not  obtainable}  this  transformation  varies  with 
such  Intangibles  as  the  analytic  viewpoint  and  the  number  of  desorlptmMI 
assigned  to  a  doctsMnt.  In  the  context  of  the  literature  seanh  system  . 
the  most  that  can  be  said  of  the  transformation  is  that,  if  the  content 
desoriptore  i^  are  prooessed  by  the  processor  p  and  the  inverse  transfoxm 
D"^,  at  least  the  original  document  will  be  obtaiimd}  however,  many  more 
documents,  representing  stqperfluous  information,  might  also  be  obtained. 
Symbolically,  this  problem  is  written  asi 

^  1  (U-6) 

On  the  other  hand,  the  assipunent  of  the  addross  descriptor,  D^,  is  a 
unique  transformation;  it  uniquely  Identifies  a  particular  document. 
Symbolically: 

•  1  (U-7) 

Thus  for  the  whole  transformation  D: 

DD"^  -  1  (U-8) 

Descriptors  are  a  restricted  standard  langu^e.  Documents  and 
queries  are  transformed  into  this  standard  language,  perhaps  associated 
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uttli  otfemr  ■tudbyrd  t«nw,  a&d  tbin  doouMnt  idntlflMW  m  rclfflvvad 
f!rM  tte  filt.  Howtreri  ttM  nuBbcr  of  dMoripton  ml  tliolr  rleliiMt 
eotild  1m  groiAiMUy  apmdod  bgr  o^oifying  thm  m  MtionSf  rOlotloiw, 
rMid.tai»  MiM*  or  loootlona.  At  the  mm  tiaae,  the  oonstently  expendiiig 
dooerl^tor  langaego  oeuld  ha  ^ppAlad  to  aeotLone  of  doouamtey  ^an  mb- 
aeatiaMf  than  paragnpha.  A  ayataa  ean  ha  poatulatod  in  ataioli  the  daaarlp- 
ter  laagnaga  hooaMa  aa  rleh  aa  the  doouaant  or  query  ItaALf— la  in  foot 
Idantlaal  «Lth  tte  doeuMnt  w  9M>7*’*'Md  in  vtaloh  tte  daaeripter  langoaga 
agpipllaa  te  tadta  of  iafonatl<m  aa  awall  aa  aentanooa  and  phraaaa. 

If  tte  length  of  tte  daaoriptor  liat  for  oate  doamawt  ia  eactaa4ad» 
a  paint  will  avantaaUy  ba  raaobad  vten  oaeh  doetaMat  la  uniqaaly  daaoribadi 
i.a.f  it  vedLd  ba  poaaiU.a,  on  tte  baaia  of  a  given  aat  of  deaoriptora^  te 
aaieet  a  partloalar  deew  U  fran  the  doeawHit  atoro.  At  thda  point, 

•  1}  then  baeoawa  rodnndant  and  oan  be  alininatad*  Ite  point 
baa  net  yet  boon  rMobad  vtaare  it  ia  poaaible  to  atop  atoring  deoananto 
and  te  ral7  tqpon  tholr  tranafenaationa  (daaeriptora)  and  rolaa  fbr  teoir 
raevaatlon  fron  tbaaa  tranafomatima.  Vten  the  daaoriptor  liat  ia  extended, 
thane  daaeriptora  and  aa  appropriate  aet  of  mlea  can  reereate  tte  dealred 
doeuannt  aitbin  oertain  lladta.  It  nagr  not  be  poaaible  te  bbtain  a  nerd 
for  nerd  copy  of  the  original,  bat  the  roaolta  will  diqxlieate  the  naaning 
of  the  original. 

aynbolicaily,  thia  optlaun  ayataai  ia  repreaonted  by  a  alight 
andifioation  to  Sqioation  U<-8t 

DD'*^  d  1  (U-9) 
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This  f^tion  moans  tViat,  given  a  docmont  I  and  tho  transi'ormation 

algwlitun  D|  a  set  of  descriptors  i  can  be  obtained.  Farther  transform* 
.Ing  this  set  1  by  the  inverse  transformation  does  not  yield  I  exactly, 
but  it  does  yield  a  document  I*,  that  is  close  to  I.  Since  i  is  a  resultant 
of  tbs  transformation  D(I)-*8ee  Equation  U-1,  thent 

I'  =  -  d"^[d(i)]  (U-io)  • 

where  !•  *  I.  But  the  kind  of  transformation  alf'orithm  necessary  for 
this  system  is  tlie  same  as  that  necessary  for  content  retrieval;  any 
difference  may  be  in  the  format  or  rules  of  the  transformation. 

It  may  be  possible  to  transform  a  document  for  content  retrieval 
into  a  format  that  is  more  oonpaot  and  convenient  than  a  list  of  dsserlp* 
tors.  For  content  retrieval,  each  sentence  would  be  transformed  into  a 
uniqus  description  that  could  readily  be  re-tranaformed  into  a  close 
approodmation  of  the  original. 

U.2.3.2  The  P„^annform  -  The  major  task  of  the  P  transform  is 
to  select  sets  of  descriptors  that  have  been  stored  in  the  system  on  the 
basis  of  their  relationship  to  the  request  (descriptors.  Symbolically, 

r  -  P(q,S^)  (li-11) 

where  r  Is  the  untransfomed  response.  This  transform  can  be  viewed  as 
having  two  parts}  a  storage  function  that  stores  and  relates  all  the 
incoming  ..descriptors  and  a  selector  function  that  matches  the  query 
descriptors,  q,  to  the  set  stored  descriptors,  S^. 

For  the  literature  search  problem  the  selection  process  has 
criteria,  among  others,  that  should  be  noted} 
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(a)  To  ModBiito  tho  anou&t  of  roloront  information  obtainod. 

(b)  To  minimiBO  tho  number  of  Irralovant  or  ernmeoua  anowwo. 

For  tho  oontent  retrieval  problem  these  criteria  reduce  to  that  of  finding 
an  aooeptable  ansver  to  a  quex7. 

TtauSf  it  is  important  for  the  transfoxmatiem  process  P  to  be 
able  to  obtain  or  measure  the  degree  of  roleTaaoe  of  one  set  of  informal 
tlon  to  another.  These  relatiwuhip  indloations  are  e8peolall7  important 
for  retrieval  systems  used  for  relatively  unoategoriBed  data  in  vhieh  many 
different  descriptions  of  the  same  content  might  be  possible— a  condition 
that  leads  to  difficulties  in  atttohing  request  and  data  dssoriptors.  One 
may  to  indicate  relationships  among  data  is  the  kind  of  logloel  strusture 
used  to  store  information.  The  actual  struoture,  taovever,  may  not  be  able 
to  Imdioate  the  strength  of  thsse  reiationehipe)  i.e.,  the  degree  of 
relevnaoe  or  eleeeness  among  data.  It  may  be  necessary  to  provide  a  metarle 
for  the  struoturs  to  detemiae  the  strengths  of  these  reiationehipe  amd 
to  provide  father  indications  of  rdevanee,  sueh  as  probabilistio  meaeuree, 
that  nay  bn  inooxporated  into  the  storage  stmoture.  The  seleoter  fonstion 
of  F  wodld  use  these  relatl^faips  and  their  metrics  as  the  basis  of  its 
sSleotion  algorlthne.  P  nsy  then  be  viemed  as  a  ooebination  mmsory, 
assooiational  net|  and  seleotlon  neohanlsm. 

In  a  literature  search  ^tem  the  i  and  q  mould  generally  be 
descriptor  lists,  and  P  mould  store  relationships  betmeen  these  desorip* 
tors.  The  output,  r,  of  the  transformation  mould  be  the  identifying 
descriptors,  usually  addresses,  of  the  relevant  documents. 

For  a  oontent  retrieval  system  the  1  and  q  mould  have  a  more 


20 


oonplex  format,  but  P  would  still  be  required  to  relate  the  various  data 
elements  to  each  other.  The  output,  r,  would  be  in  the  same  format,  as 
the  1. 


U. 2.3(3  The  E  Transfonn  •  The  E  transform  that  transforms 
requests  or  queries  into  descriptor  language  is  basically  the  same  as 
the  D  transform.  The  major  difference  that  might  exist  would  be  that 
of  fomat}  for  the  P  transform  might  require  a  foimat  for  the  trans- 
formed  documents  or  input  data.  No  address  indication  would  be  included 
in  the  transformed  query,  but  the  same  kind  of  infonnation  would  be 
indicated  in  the  transformed  query  as  in  the  transformed  document.  In 
other  words,  the  same  kind  or  similar  language  would  be  used  for  the 
transformed  documents,  i^,  and  the  transformed  queries,  q.  Symbolically, 
these  relations  aret 


E  /w  (for  literature  search) 

(for  content  retrieval) 


(it-12) 


E  ~  D 

The  nature  of  these  transforms  is  such  that  there  is  no  loss  of  information 


in  shifting  from  the  i^  format  to  the  q  format  and  back}  l.e.t 


q  -  G(i^)  -  G[G"^(q)] 


(U-13) 


where  G  is  the  appropriate  one-to-one  transformation.  In  most  cases 
G  ■  1,  for  q  and  i  are  expressed  in  the  sane  format. 


U.2.3.U  The  D~^  Transfonn  -  For  the  literature  search  the  D”^ 
transform  is  usuially  concerned  with  the  addresses  of  documents.  On  the 
basis  of  these  addresses  the  algorithm  locates  the  documents  in  a  file. 
The  important  part  of  this  transform  for  the  literature  search  is  D*^, 
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vtaioh  la  an  arar  inoraaaing  fila.  If  the  aystam  la  a  cuntant  ratrlaval 
ayatan,  than  la  not  a  fUa  but  a  aat  of  rulaa,  oonparabla  to  D|  for 
tranaformlng  tba  daaorlptor  sat,  r.  Into  tha  raaponaa,  R. 

U.2.U  Sunnaiy  of  tha  Qanaral  Ratrlaral  Modal  •  Tha  ganaral  Infoxaa- 
tlon  ratrleval  modal  can  ba  aunmariaad  aynbolloally  as  folloira.  Qlvan  a 
sat  of  doouBients  or  file  Items,  S^,  In  a  retrieval  system,  T,  a  query  Q 
produces  a  response  R: 

R  •  t[3,Sj  ]  (U-lU) 

Soma  of  tha  intemadlata  transfoxnatlons  that  occur  In  this  systaai  can 
ba  written  ast 

1  ■  D(I)  (tha  daacrlptor  assigning  process) 

q  •  E(Q)  (tha  query  transfoxnatlon) 

r  •  (the  saleetlon  process) 

where  D,  E,  and  P  are  transformation  algorithms;  1,  q,  and  r  are  Input, 
query,  and  response  descriptors;  and  Is  the  set  of  stored  descriptors. 
Thant 

R  -  D-^{p[q,S^]  I  (U-15) 

m  tezms  of  the  original  variables  Q  and  It 

R  -  d"^  I P  [ E(Q),Sp .j j  ]  I  .  (U-16) 

since  Sj,  ■ 

U.2,5  Specific  Aspects  of  Retrieval  Problem  -  Infoxmatlon  ratrleval 
systems,  whether  actual  or  theoretioal,  are  conqposed  of  many  elements. 

The  general  retrieval  model  highlighted  three  basic  elements  that  any 
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■■  -i:* 
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usable  Information  retrieval  system  must  have: 

(a)  Deserlptors  or  terms  and  their  relationships. 

(b)  Files  of  data  and/or  terms  or  desorlptors  with  an  organisation 

or  struoture. 

(o)  Procedures  for  searching  files  and  locating  data  or  terms. 
Investigations  Into  the  problems  associated  with  each  of  these  areas  are 
dlsouseed  briefly  in  the  following  paragraphs. 

U.2.5.1  Descriptor  Systems  -  Descriptors  are  introduced  into 
information  retrieval  systems  in  order  to  reduce  the  language  recogni¬ 
tion  and  transformation  requirements  arid  to  reduce  the  eompleaclty  of  the 
data  struetures  or  content  relationships.  In  short,  desorlptors  repre¬ 
sent  an  artificially  restricted  standard  language  to  increase  the  oon- 
venienoe  of  handling  requests,  constructing  and  organlaing  the  ooiqputer 
files,  and  searching  for  aiuwere. 

One  of  the  major  problems  In  coxistructing  a  descriptor  systam  is 
the  proper  sti.eotlon  of  the  descriptors  that  are  class  names  for  synonyms 
so  as  to  maximise  retrieval  of  relevant  Information  and  mininlBe  noise, 
the  retrieval  of  Irrelevant  data.  The  descriptors  must  be  words  in  eom- 
non  use,  as  unambiguous  as  possible,  and  sufficiently  num  rous  to  delin¬ 
eate  relatively  fine  distinctions.  Obviously,  the  more  doouments  filed 
under  a  given  descriptor,  the  larger  the  noise  Is  likely  to  be. 

To  Increase  the  number  of  relevant  docuii»nts  retrieved  In 
response  to  a  given  request,  descriptors  for  the  request  can  be  weighted. 
These  weights  can  be  assigned  aooording  to  the  relevance  and  the  inpor- 
tanoe  of  the  particular  descriptor  vmder  consideration.  The  system  oan 
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than  produoe  rasponsas  ordered  eooording  to  Heights  assigned  d|soriptors 
or  responses  greater  than  a  fixed  weight  of  relevanoe  and  iaportaaee. 
iaethsr  seheae  for  reduoing  irrolevanoe  In  responses  is  to  assign  descrip¬ 
tors  to  each  section  of  doounents  added  to  the  file.  This  method,  of 
oourse,  inoreaaes  the  degree  of  content  retrieval. 

Increasing  the  flexlbilitgr  of  descriptors  hgr  introduoing  role 
Indloatore  or  specifying  terns  as  aetlons,  relations,  results,  means, 
purposes,  or  locations  is  a  further  step  toward  content  retrieral  in 
the  sense  that  it  is  the  beginning  of  syntaotioal  and  semantic  speolfl- 
oation  of  request  terms. 

Soon  of  the  questions  that  most  be  answered  before  desipulng  a 
descriptor  system  are: 

(a)  What  desoriptive  terns  are  likely  to  be  needed? 

(b)  How  specific  will  the  requests  be? 

(o)  Will  both  speolflc  and  generic  queries  be  made? 

(d)  Is  the  sane  infoxnatlon  rderant  to  speolflo  and  generic  queries? 

(e)  Is  the  oorrelatlon  of  the  chosen  descriptors  sufficiently 
selective? 

(f)  If  not,  to  what  extent  are  interlookixig,  interfixtng,  and  spe¬ 
cifying  of  syntactic  and  semantic  relations  neoessary  and 
helpful? 

U.2.$,2  Organisation  and  Structure  of  Files  -  If  information 
retrieval  is  viewed  generally,  it  can  be  defined  as  locating  and  present¬ 
ing  a  specific  infoimative  and  accurate  answer  or  piece  of  information 
in  response  to  a  speolfio  questitni.  Acoonpllshing  this  function  requires 
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a  olaaaifioation  sohama  that  groups  larger  units  of  related  infonnatlon} 
e.g.y  docxanerrbs  or  sections  of  documents.  Descriptors  are  assigned  to 
units  of  information.  The  file  consists  of  the  system  of  dssorlptors 
and  of  information  units  ordered  in  some  fashion  to  indicate  the  rela> 
tlons  betveen  descriptors  and  information.  Generally,  a  descriptor  is 
associated  with  many  \inlts  of  Information  and  a  unit  of  Information  may 
be  described  by  several  descriptors.  In  addition,  the  file  structure 
must  provide  for  relations  among  inl'ormatlon  \mits  and  among  descriptors. 

One  of  the  best  known  systems  that  can  be  used  to  relate  deserlp« 
tors  is  the  hlerarchleal  classification  or  tree  structure  originally 
developed  for  biological  dasslfloatlon.  This  type  of  structure  forms  a 
Boolean  algebra  under  the  relation  of  class  inclusion.  This  type  of 
model  is  appropriate  only  for  a  limited  field  of  information  in  which  a 
class  is  immediately  subordinate  to  only  one  other  class.  This  restric¬ 
tion  requires  a  breakdown  into  small  units  of  infonnatlon,  which  means 
that  the  descriptor  file  would  be  composed  of  a  large  number  of  hier¬ 
archies  of  class  inclusion.  (The  multilist  system  is  a  device  for  cir¬ 
cumventing  the  limitations  of  ordinary  list  processing  or  hierarchies  by 
allowing  for  relations  among  branches.) 

For  information  fields  of  some  diversity,  the  relations  among 
descriptors  usxially  form  ooiqplieated  networks  to  which  the  tree  theory  is 
not  directly  applicable.  A  general  model  of  a  conplicated  descriptor 
network  is  represented  by  means  of  a  ooniplemented  modular  lattice.  This 
model  is  of  sufficient  generality  to  cover  a  wide  variety  of  situations. 
Most  elements  are  multiply  connected  rather  than  singly  connected  as  in  a 
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tP99,  The  lattice  model  is  referred  to  ae  a  weak  hierarchy— an  element 
may  have  more  than  one  predecessor.  The  tree  is  a  strong  hierareby— 4a 
element  has  only  one  predecessor.  The  principal  problem  with  the  lattice 
model  is  that  the  number  of  nodes  in  the  netwozk  quieldy  reaches  into  the 
millions  if  all  relations  between  descriptors  are  represented.  Coiue- 
quently,  the  problem  beomnes  one  of  effectively  limiting  the  number  of 
relations  represented  among  descriptors. 

The  descriptor  file  associates  descriptors  with  informati«a 
units  or  items  of  data.  These  associations  can  be  represented  by  a  matrix 
of  ones  and  zeros,  where  descriptors  may  be  ordered  as  rows  and  informa¬ 
tion  units  as  columns.  A  one  indicates  a  relation;  a  zero,  none.  For  a 
rich  infoimation  store,  this  matrix  will  be  large  and  most  of  its  SleaMBts 
will  be  zeros.  It  is,  therefore,  an  uneconomical  representation.  The 
matrix  can  be  con^ressed  by  listing  rows  or  columns  (descriptors  or  data) 
and  related  items  only  for  each  entry.  Of  cowse,  access  to  the  file  is 
much  simpler  for  descriptor  entry.  Search  time  for  these  t3nE>es  of  files 
can  be  reduced  by  using  multiple  entry  of  tezms  or  by  an  ordered  arrange¬ 
ment  of  both  descriptors  and  data.  Oeneric  relatio^is  among  terms  can  be 
shown  by  direct  cross  references,  carried  with  each  descriptor,  or  by  a 
code  of  hierarchical  class  numbers  showing  the  generic  structure  of  the 
terms. 


U.2.5.3  Search  Procedures  -  In  a  retrieval  system  based  upon 
descriptors  there  are  two  requirements  for  effective  search.  The  first 
is  the  transformation  of  the  request  into  the  standaz\l  search  tenns.  The 
second  is  the  particular  strategy  or  methodology  for  searching  the  descriptor 
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file  effectively  and  fruitfully. 

Transforming  a  request  into  standard  descriptor  terns  is  basically 
a  form  of  translation  from  a  rich  i.anguage  into  a  sunnaxy  language  or  the 
matching  of  two  seta  of  terms,  one  large,  the  other  smaller.  In  order  to 
aoooiiq;)lish  this  transformation,  the  meaning  and  relations  between  terms  of 
the  two  sets  or  languages  must  be  understood.  Aid  may  be  provided  in  the 
form  of  a  dictionary  or  glossary  of  subject  natter.  The  knowledge  required 
to  transform  requests  into  descriptors  is  most  simply  provided  to  a  oon- 
puter  by  furnishing  it  with  a  thesaurus.  Any  more  sophisticated  means 
would  involve  a  considerable  capability  for  linguistic  transformation  on 
the  part  of  the  computer. 

The  formulation  of  a  query  and  its  transformation  into  a  lindtod 
set  of  descriptors  often  does  not  provide  sufficient  Information  and 
direction  to  obtain  exhaustive  information  concerning  a  subject  that  may 
exist  in  the  data  file.  Effective  search  procedures  are  closely  related 
to  the  way  in  which  the  descriptor  file  is  structured  and  what  sort  of 
relations  are  indicated  there.  The  most  common  method  of  searching  is  the 
conjunctive  search,  which  retrieves  only  that  information  related  to  or 
encompassed  by  all  the  request  descriptors  in  conjunction.  There  is  a 
real  need  for  investigating'  search  procedures  in  terms  of  logical  sums, 
differences,  complements,  and  more  complicated  combinations  of  these  func¬ 
tions  as  well  as  weighted  logical  functions  in  terms  of  set  densities. 

U.3  MEASURES  OF  RELEVANCE 

U.3.1  General  -  The  formulation  of  a  query  and  its  transfomatlon 
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into  a  United  set  of  deeerlptors  often  doee  not  provide  euffiolent 
information  and  dlzeotlon  to  obtain  exbaaetlye  information  oonoeming 
a  Bubjeot  that  may  exist  in  the  data  file.  Effective  search  prooedaraa 
are  oloeely  related  to  the  way  in  which  the  descriptor  file  is  struetored 
and  to  the  sort  of  relations  indicated  by  the  strueture.  An  effective 
information  retrieval  system  must  have  the  autoMted  oapabllity  to  aseo* 
elate  other  desorlptors  in  the  system,  which  are  applloable  or  relevant 
to  the  topic  in  some  degree,  with  those  derived  directly  from  the  request. 
Several  ways  of  determining  the  degree  of  dependence  or  relevance  among 
desorlptors  have  been  suggested.  Sinoe  this  problem  is  a  key  aspeot  of 
information  retrieval,  s<»iie  of  the  schemes  for  measuring  tbs  asaoolation 
or  tbs  relevance  of  terms  are  outlined  and  dlsoussed  briefly  in  the  follow¬ 
ing  paragraphs.  These  sohemss  are  also  reduced  to  a  common  system  of 
notation  to  facilitate  comparison. 

U.3.2  Method  1  -  This  method  is  baaed  upon  the  work  of  Fairtborns  (1). 
Consider  a  set  of  items  that  has  been  eonpLet^y  classified  or  categorised 
under  subject  headings]  that  is,  each  item  has  been  assigned  to  one  or 
more  categories.  These  items  form  a  Boolean  algebra  in  wfaieh  the  dohble 
oaa(laawnt  law  is  valid.  That  is,  the  set  of  items  that  are  not  not-A»a 
is  identical  with  the  set  of  all  items  that  are  A 'a,  where  A  is  a  cate¬ 
gory.  Lr  a  dynamic  system,  there  will  generally  be  items  that  have  not 
been  so  classified,  bvct  knowledge  of  their  existence  would  be  helpful  to 
the  user.  These  items  may  not  have  been  classified  for  several  rfasonst 
their  proper  classification  is  doubtful  or  unknown;  they  are  not  aooes- 
slble]  or,  perh^pe,  there  has  been  insufficient  tine  to  categorize  them. 
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These  items  tre  now  added  to  all  the  categories  that  might  be  relevant* 
Inelnding  all  the  existing  oategories  If  relevanoe  is  oonqpletely  unknoaB. 
With  this  classification  sohene*  all  but  not  only  or  only  bub  not  all 
items  oan  be  retileved>-the  first  by  including  items  in  the  doubtfol  oat* 
egozy*  the  second  by  ignoring  items  in  the  doubtful  category. 

This  concept  oan  be  esqpressed  more  formally.  If  the  correct  olaeslfi* 
cation  of  some  Items  is  doubtful*  a  system  has  two  types  of  ooiqlemsnts  of 
a  given  set  of  terms.  Ihese  oonqplements  eonqprise  the  inclusive  and  exclu* 
sive  oonplaments  of  sets  as  shown  in  Figure  3*  The  set  A  is  represented 


-  (Ai)i  - - A«  - H 

a _ b _ e _ f 


Certain 

Doubtful 

A 

A 

A* 

d  o  h  g 

\< - (A*)*  -  (A')' - H 

1^ -  (A*)* - 


LEGEND:  A  -  Set  uzxler  oonslderation 

A«  >  Exclusive  conqplement  of  A 
(A*)*  ■  Exclusive  conpleiiient  of  A*  (all  but  not  only) 

A<  <•  Inclusive  oomplenent  of  A 
(A')'  ■  Inclusive  complement  of  A*  (only  but  not  all) 
(A*)*  -  (A')'  •  Doubtful  A 

FIQDRE  3.  Inclusive  and  Exclusive  Complements  of  Sets 

by  the  rectangle  abed  plxis  an  a  priori  unknown  number  of  documents  in 
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tte  rectangle  bate*  Tbe  axolnaiee  ooagplanent  i«  of  a  aet  A  la  deflaad 
as  the  largest  set  of  Itans  that  oartainly  does  not  oentaln  aagr  esMbars 
of  A}  Ae  is  represented  bgr  the  rectangle  efgh.  Then,  (A*)*  is  the 
anallest  set  of  itens  that  oertainly  contains  all  the  nambers  of  At 
naaaly,  the  rectangle  aehd.  The  inolusire  oonplenent  A'  is  defined  as 
the  snallest  set  of  itens  that  oertainly  contains  all  the  items  that  are 
not  nesibers  of  A)  clearly,  tfaia  set  is  the  rectangle  bfgc.  (A')'  is  the 
largest  set  that  oertainly  contains  only  elenents  of  A.  Thus  (A')'  is 
the  rectangle  abod«  Doounents  of  aaiblguoua  or  doubtful  elasslfieatlen 
irill  be  elenents  of  (k*)*.  When  their  proper  classification  has  bean 
rasolved,  they  beeone  tenants  of  (A')'. 

Define  tbe  distance,  d,  between  two  sets  as  the  number  of  elements 
in  their  synaetric  difference.  That  1st 

d(A,B)  -  A-B  U  B-A  (U-17) 

This  definition  has  the  properties  that  a  good  definition  of  distanoe 
should  hsTs.  In  particular,  it  satisfies  the  axioms  for  distanoe  la  a 
metric  space. 

This  concept  can  be  applied  to  the  classification  scheme  Just  dis¬ 
cussed.  The  interpretation  of  distance  in  this  case  is  the  remoteness 
or  Irrelevance  of  two  topics.  There  are  two  distances  corresponding  to 
the  two  complements,  as  illuetrated  by  Figure  U.  Tbe  incluslTS  distanoe 
is  the  least  set  of  items  that  certainly  includes  all  itens  that  beloag 
to  one  but  not  both  of  the  sets.  This  set  is  reprasented  by  the  axae  of 
the  rSCtangleB  abed,  efgh.  and  Jkho  in  Figure  b.  The  aolousive  distance 
is  the  largest  set  of  items  that  certainly  belongs  to  one  of  the  sets 
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LiBQSIDt  Inoluslve  distance  between  A  and  B:  abed  efgh  Jkhe 

Exoluslre  distance  between  A  and  B:  abed  efgh 

Measure  of  uneertalntgr  of  relevance  of  A  and  B:  jkhc 

FBjURE  U.  Indualve  and  Exclusive  Distance  Between  Sets 


but  not  both)  that  Is,  the  rectangles  abed  plus  efgh.  Obviously,  the 
Inclusive  distance  Is  always  greater  than  or  equal  to  the  exclusive  dis¬ 
tance.  The  difference  between  the  two  cUstances-Hiamely,  the  rectani^ 
Jkho-Haeaaures  the  current  uncertainty  about  the  relevance  of  the  two 
topics  In  a  pairtloular  system.  Documents  of  uncertain  classification 
are  in  the  set  (k*)*  -  (A')'*  This  point  Is  evident  In  Figure  3. 


U*3«3  Method  2  -  A  second  measure  of  distance  between  topics  Is 
adapted  from  KUngblel  (^)  and  Is  a  modification  of  the  first.  This 
measure  Is  a  normalised  version  of  Eq[aation  U-17.  Method  1  produces 
Inordinately  large  dlstax»es  for  large  sets.  The  purpose  of  tbe  second 
method  Is  to  obtain  a  measure  that  is  more  Independent  of  the  number 
of  elements  in  the  set.  Tbe  modified  definition  is: 

./.  _  A  U  B  -  A  n  B 

- FTTfi - 


.  ^  UB 
’TITS 
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(1).18) 


U.3.U  Mathod  3  -  The  third  mthod  ia  adapted  from  Mooere  (8). 
Information  oonoomlng  a  given  topic  oan  be  thought  of  ae  a  oonjnnotion 
of  applicable  deaoxlptors.  The  eloeeneee  of  tuo  topics  can  be  aoasurod 
bgr  a  ocnqpazlson  of  weighted  desoriptors  that  the  two  topics  have  In  com¬ 
mon.  The  descriptors  of  the  system  can  be  Identified  with  an  ordered 
sequenoe  of  binary  bits.  Each  descriptor  is  represented  by  ^  positioii 
In  the  blxiary  number.  If  the  descriptor  is  applicable  to  a  certain  topie 
A,  then  a  1  igjpeani  in  that  position^  otherwise  a  0.  Each  position  is 
also  assipied  a  postive  velght,  w^  (for  the  bit),  Indicating  the 
Inqocrtanoe  or  degree  of  relevanee  of  that  dosorlptor  to  the  topio.  The 


dlstanoe  d  between  two  topics  A  and  B  oan  then  be  defined  as : 


(U-19) 


where  a^  and  bj  are  the  bits  of  the  respeotive  ordered  descriptor 
numbers.  This  definition  requires  that  there  be  at  least  one  descriptor 
oonmon  to  the  two  topics.  An  anomaly  of  this  definition  is  that  it  does 
not  satisfy  the  axioms  of  distance  in  a  metric  space.  In  particular,  it 
is  not  necessarily  the  case  that  d(A,C)  ^  d(A,B)  d(B,C). 


U.3*5  Method  U  -  This  method,  which  has  been  discussed  by  Vatiuiabe 
(10),  considers  a  probabilistic  model  for  the  association  of  terns.  It 
associates  either  descriptors  or  items  on  the  basis  of  tbs  correlation 
among  them.  The  relationship  between  items  and  descriptors  is  presented 
in  the  form  of  a  matrix.  In  this  matrix  each  element  represents  the 
assignment  or  non-assignment  of  a  descriptor  to  an  item.  The  iten- 
desozlptor  matrix,  T,  is  then  defined  as  an  m  by  n  matrix  whose  element 
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XU  ^u 

T(Xj^,  jTj)  of  tha  i~  row  and  J—  column  is  1  or  0,  according  to  lAwthsr 
item  does  or  does  not  have  descriptor  . 

C«isidsr  now  a  large  collection  of  items,  X  -  (x^),  i  ■■  l,2,...,m, 
with  a  set  of  descriptors,  T  ■  (y^),  3  ■  1,2,..., n.  The  probability  that 
an  arbitraiy  item  has  the  description  bj^,b2...,b^,  which  is  an  ordered 
sequence  of  bits  representing  the  api0.ioabllity  of  the  n  descriptors,  is 
the  ratio  of  the  number  of  rows  with  the  proper  bit  pattern  to  the  total 
nimber  of  rows  in  the  matrix.  This  probability  is  expressed  byt 

P(V^2»***»V  " 

m  ,  n  (U-20) 

-Sin  6Cb  ,  T(x  ,  y  )] 
i-1  "  3-1  J  J 

where  6  is  the  Kronecker  delta,  so  that  &(a,  b)  -  0  if  a  ^  b,  and  6(a,  b)  -  1 
if  a  ■  b. 

For  the  collection  of  items  the  uncertainty  about  the  description  of 
an  azbitrarily  selected  ob3ect  can  be  measured  by  an  entropy  function,  S(Y)  t 
S(Y)  -  -  Sp(Y)  log  p(Y)  (U-21) 

where  the  summation  is  extended  over  two  vsOnes,  0  and  1,  for  all  the  b's 
corresponding  to  Y.  Similar  entropy  functions  can  be  defined  over  subsets 
of  descriptors,  Y^,  such  that: 

S(Y^)  -  -  Zp(Y^)  log  p(Yj^)  (U-22) 

with  the  summation  extending  over  the  two  values  of  all  the  b's  correspond- 
ing  to  Y^. 

An  infoxmation  theoretical  measure  of  correlation  can  be  defined  for 
a  set  of  descriptors  Y^  with  respect  to  its  dis3oint  subsets  Y^  byt 
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(U-23) 


c(V  V  V— 

where  the  Y  are  disjoint  and  con^lete  subsets  of  Y  so  that  any  elenenti 
y^f  of  Y^  belongs  to  one  and  only  one  of  these  subsets.  The  oorrelation, 
C,  may  be  oonsidered  as  a  generalisation  of  the  information  function. 

The  total  correlation  in  Y^,  C^(Y^),  can  be  considered  as  the  redun- 
danoy  existing  in  Y  among  its  elements,  y.  6  Y  .  The  total  correlation 

M  Jr* 

then  ist 

0.(1  )  -  I  3(y  )  .  S(I  )  f  '  2ll) 

Of  course,  the  total  corxelatlon  in  Y  is  simply  t 

Gj(Y)  -  E  S(yj)  -  S(T)  (U-25) 

Cj(Y)  is  the  largest  of  all  possible  C(Yj  Y^^,  Y2,.,.,Y0). 

Correlation  between  two  descriptors,  yj^  and  y^,  can  be  broken  into 
two  parts,  similarity  and  dlssimilarltyt 

C(yjj,  Yj.)  -  C*(yj^,  y^,)  +  C"(yj^,  y^)  (U-26) 

Similarity,  7^)*  I*  a  measure  of  the  number  of  times  yj^  and  y^,  are 

jointly  assigned  or  not-asslgned  to  the  same  item.  Dissimilarity, 

7x>)>  ^  measure  of  the  number  of  times  y^^  and  y^  are  oppositely 

assigned  to  the  same  item.  Similarity  can  then  ^  expressed  byt 

p(b.  ,  b  ) 

Vr  P(\)P(V 

And  dissimilarity  can  be  expressed  byt 
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(k-27) 


(U-28) 


I 

1 

I 


p(K  j  b  ) 

“■(yk*  ^  ’’k’  TOpI:^ 

•^k*  r  ^  *■ 

In  this  description  only  the  correlation  between  descriptors  has  been 
indicated.  If,  however,  itetns  are  considered  instead  of  descriptors,  the 
correlation,  similarity,  and  dissimilarity  of  objects  may  be  meatfrired  by 
the  same  formulae. 

U.3.6  Utility  of  Measures  -  The  utility  of  these  measures  of  associa¬ 
tion,  distance,  and  similarity  lies  in  the  fact  that  they  provide  an  auto¬ 
matic  means  of  relating  request  descriptors  to  other  descriptors  and 
relating  documents  to  other  documents  or  information.  For  example,  a 
request  descriptor  could  be  given  and  the  system  would  be  asked  to  retrieve 
all  information  under  descriptors  xdth  a  similarity  to  the  given  descrip¬ 
tor  (in  the  sense  of  Method  U)  greater  than  some  prescribed  nunber.  This 
process  can  appropriately  be  called  concept  retrieval.  Note  that  concept 
retrieval  can  be  applied  to  either  content  retrieval  or  document  (partial 
content)  retrieval.  This  notion  of  concept  must  possess  a  kind  of  con¬ 
tinuity,  .namely  that  a  small  change  in  the  set  of  objects  under  considera¬ 
tion  must  produce  only  a  small  change  in  thn  concept.  Otherwise,  the 
definition  is  clearly  not  in  accord  with  intuition.  The  other  defini¬ 
tions  of  distance  can  be  used  in  a  similar  fashion  to  assist  in  obtaining 
relevant  descriptors  and/or  to  retrieve  information  ordered  according  to 
relevance. 

The  neasiu'es  outlined  here  will  not  be  evaluated  further  in  this 
report  except  to  state  that  two  types  of  evaluation  are  possible.  The 
first  is  the  theoretical  adequacy  of  a  definition  and  its  implications. 
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The  second  Is  the  ultimate  evaluation  test,  namely  the  utility  of  the 
definitions  in  terms  of  acttial  use  in  retrieving  Information  in  an  opera> 
tlonal  information  retrieval  system.  That  is,  does  the  concept  in  practice 
effectively  assist  in  the  retrieval  of  information  Judged  to  be  relevant 
to  the  request  by  the  requestor? 


U.U  REFERENCES 

(1)  Fairthome,  R.  A.;  "Delegation  of  Classification," 

American  Documentation;  Voliane  9,  March  1953. 

(2)  Gray,  H.  J,,  et  jrij  Information  Retrieval  and  the  Dcsipn 
of  More  Intelligent  Machines}  Pinal  Report  No.'  AD59URI 
to  the  U.  S.  Signal  Corps,  July  1959« 

(3)  Gray,  H.  J.,  et  alj  The  Multi-List  Systemt  Report  to 
the  Office  of”Haval  Research,  Information  Systems  Branch, 
under  Contract  NOnr55l(UO),  November  1961, 

(li)  Green,  B.  F,,  Wolf,  A,  K,,  Chonslqr,  Carol,  and  Laughery, 

K.j  "Baseball:  An  Automatic  Question-Answer,"  Proceed- 
jntTS  WJCC;  IRE,  Los  Angeles,  Kay  1961, 

(5)  Klingbiel.  P.  H. ;  Laniruage  Oriented  Retr-ieval  Systerae, 

(AD  271-600);  February  T9'6’2*.'  . 

(6)  Luhn,  H.  P,;  Auto-Encoding  of  Docunents  for  Information 
Retrieval  S:^ leans';  lEW  tlesearch  Center,  ^orktown  Heiglits. 
irwT^r'kT'1%?:”'?  pp, 

(7)  Maron.  M.  E.;  AuiWmatic  In^xing;  An  Experimental  Inquiry. 
(AD  2U5-175);  RAMB  Corporation,  Santa  Monica,  California, 

10  August  i9w:  37 

(8)  Mooers,  G.  N.j  The  Use  of  Symbols  in  Information  Retrieval, 
radC-TN-59-133,  cab  '213-'7B2')  ;■ '  April  1'9F9. 

(9)  Perry,  J,  W.,  Kent,  A.,  and  Berry,  M,  M.;  Machine  Litera¬ 
ture  Searching;  New  York,  1956. 

(10)  Watanabe,  S.;  A  Probabilistic  View  of  the  Formation  of 
Concept  yid  of  itesociation;  presented  at  the  annual  meeting 
oi*  ihe  AAaS,  ^6-30  December  1961. 

(11)  Personal  communications  and  informal  briefing. 


CONCLaSIONS 

Four  aapeots  of  th«  researoh  orientation  hare  been  deeoribedi  eyeten- 
procedure)  real-bypothetleal,  hardware-software,  reductlonHnanlpulatlon. 

A  theoretical— procedural,  b^pothetlcal,  software,  manlpiilatlve— approach 
Is  being  taken.  A  preliminary  generalized  model  has  been  formulated,  and 
some  of  Its  inqplloatlons  hare  been  oonsldered.  One  procedural  area,  the 
measurement  of  relevance,  has  been  formally  elaborated.  Further  work  on 
the  functional  characteristics  of  a  general  theory  of  inf  or  rfiation  retrieval, 
the  development  of  the  model,  and  the  formal  consideration  of  additional 
proced\ires  and  techniques  is  required. 


37 


6.  FLANS  FOR  THE  NEXT  QUARTER 


Activities  during  the  next  quarter  will  proceed  with  the  over-all 
goal  of  developing  a  theory  of  information  retrieval  for  use  as  a  tool 
in  the  design  of  Inf ormation . retrieval  systems.  V/ork  will  include  at 
least  the  following  three  aspects  of  the  development  of  such  a  theory. 

(a)  A  statement  of  the  necessary  or  desirable  features  of  a  theory 
of  information  retrieval  together  with  a  breakdown  of  the 
essential  functional  elements  of  information  retrieval  and 
their  interrelationships. 

(b)  Continue  development  of  an  information  retrieval  model  based 
upon  Item  (a)  and  the  models  described  in  this  report.  'This 
work  will  tise  and  relate  the  results  of  Item  (c). 

(c)  Continue  work  on  functional  elements  of  the  model  and  tech¬ 
niques  that  are  applicable  to  the .effective  performance  of 
these  essential  functions  (e.g,,  measures  of  relevance  as 
applied  to  descriptor  assignment). 

These  three  aspects  of  the  work  are  actually  levels  of  detail.  The 
first  provides  a  general  statement  of  tlie  objectives  of  the  research, 
defines  essential  areas  of  effort,  and  provides  guidelines  and  defini-' 
tlons  for  use  in  the  development  of  the  theor;.  The  second  level  of 
effort  develops  and  defines  the  essential  features  of  the  theory  to-  the 
point  where  a  representative  model  is  meaningful.  It  will  Isolate  inde¬ 
pendent  functions  and  establish  relations  between  fmctions  that  are  not 
independent.  The  third  level  develops  detailed  techniques,  procedures,  ' 
and  methodology  useful  for  the  design  of  an  effective  information 
retrieval  system. 
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systems.  Previous  experience  includes  the  analysis,  evaluation,  and 
design  of  convex  radar,  control,  and  defense  systems. 

7.2.3  Quentin  A.  Darmstadt  -  AB,  Mathematics,  Oberlin  College,  1950} 
advanced  studies  in  mathematics  and  mathematical  logic,  Harvaz^l  University, 


la 


and  Mav  York  tbilveraltgr.  AotivltleB  center  upon  developing  Ipgleal  atiU 
Mttenatloal  proofe  leaditig  to  the  fomulatlon  of  algoritbne  for  aolvlng 
problMB  on  eleotronlo  digital  oonputera.  Eiqperlenee  inoludee  operational 
analyeia  and  evaluation  of  syeteme. 

7.2 .U  George  Oreeniberg  -  BA*  Peyohology*  Brooklyn  College*  19$$ i 
PhD*  Fayobology*  Duke  UniTerelty*  I960.  Aotlvltiea  include  peyohologioal 
reaearoh  in  learning*  peyehO'llngulsties* '  and  perception.  Previous 
experlenoe  inoludes  the  organisation  of  researoh  into  the  automation  of 
oomnand  languages . ' 
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